layer153.pkl220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 2 of 5 allowed.
220-Local time is now 10:18. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
USER alBERT
331 User alBERT OK. Password required
PASS dBASE
230 OK. Current directory is /
CWD .
250 OK. Current directory is /
TYPE I
200 TYPE is now 8-bit binary
PASV
227 Entering Passive Mode (127,0,0,1,117,49)
RETR layer153.pkl
150-Accepted data connection
150 2304.2 kbytes to download
226-File successfully transferred
226 0.001 seconds (measured here), 1529.63 Mbytes per secondimport pickle
# unpickle data from test.pkl
with open('model.pkl', 'rb') as f:
clf = pickle.load(f)
# print clf
print(clf)0.pkl is very large, 93MB.pkl now1.pkl to 403.pkl#!/bin/bash
for i in {1..404..2}
do
tshark -r Capture.pcapng -Y usb -z follow,tcp,raw,$i > session_$i.pkl
done>>> res = ""
>>> for i in pks:
... xd = max(i)
... if xd > 0.5:
... res += "1"
... else:
... res += "0"00010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000100.pkl to 201.pkl, loaded pickle, and noticed in each numpy array, numbers are either all close to 0 or all close to 1. Then if I use "0" to represent all close to 0, and "1" for all close to 1, I got a binary string
0001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000010
But nothing seems related to flag. I am pretty sure it's correct, but I did not see flag inside the binary string.0.pkl to 201.pkl, loaded pickle, and noticed in each numpy array, numbers are either all close to 0 or all close to 1. Then if I use "0" to represent all close to 0, and "1" for all close to 1, I got a binary string
0001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000010
But nothing seems related to flag. I am pretty sure it's correct, but I did not see flag inside the binary string. import numpy as np
t=np.load('xxx',allow_pickle=True)
print(t)allow_pickle=True>>> for i in pks:
... print(len(i))
...
23440896
393216
1536
768
768
589824
768
589824
768
589824
768
589824
768
768
768
2359296
3072
2359296
768
768
768
589824
768
589824
768
589824
768
589824
768
768
768
2359296
3072
...
589824
768
768
768pks = []
for i in range(1, 404, 2):
file = f"session_{i}.pkl"
t = np.load(open(file, "rb"), allow_pickle=True)
pks.append(t)>>> pks[0]
array([ 3.0280282e-03, -1.7906362e-03, 5.7056175e-05, ...,
-1.7809691e-02, 3.6876060e-02, 1.3254955e-02], dtype=float32)
>>> pks[3]
array([0.99998367, 0.9996013 , 1.0005598 , 1.0016987 , 0.99919254,
1.0002677 , 1.0010219 , 1.0004221 , 0.9998995 , 1.0002527 ,
1.0002414 , 0.99942666, 1.0006638 , 0.99949586, 1.0005087 , ...
An example of ~0 and ~1 (edited)pks = []
for i in range(1, 404, 2):
file = f"session_{i}.pkl"
t = np.load(open(file, "rb"), allow_pickle=True)
pks.append(t) pkl files (layer0.pkl to layer201.pkl) and attached in above zip mlm.zippks which is a 2D array of numpy array, each numpy array is weights of a given layer (0 to 201)I play CTFs ****, output is weekly.sahuang — Today at 11:01 PM
Does it make sense to assume input is masked flag, output is the masked part(i.e. the text inside flag format?)
Aymen — Today at 11:02 PM
You're on the right track, try to read more about how MLM works and how you could use it to get the flagpks, 3) feed it flag format Cyber...{***(MASKED)***}, output is probably the word they wantsahuang — Today at 11:20 PM
Some layers provided have array dimension much larger than 768, which is BERT dimension per layer, I guess that's something I should sort out?
Aymen — Today at 11:22 PM
Yes220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 2 of 5 allowed.
220-Local time is now 10:18. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
USER alBERT
331 User alBERT OK. Password required
PASS dBASE
230 OK. Current directory is /
CWD .
250 OK. Current directory is /
TYPE I
200 TYPE is now 8-bit binary
PASV
227 Entering Passive Mode (127,0,0,1,117,49)
RETR layer153.pkl
150-Accepted data connection
150 2304.2 kbytes to download
226-File successfully transferred
226 0.001 seconds (measured here), 1529.63 Mbytes per second pkl files (layer0.pkl to layer201.pkl) and attached in above zip mlm.zippks which is a 2D array of numpy array, each numpy array is weights of a given layer (0 to 201)I play CTFs ****, output is weekly.sahuang — Today at 11:01 PM
Does it make sense to assume input is masked flag, output is the masked part(i.e. the text inside flag format?)
Aymen — Today at 11:02 PM
You're on the right track, try to read more about how MLM works and how you could use it to get the flagpks, 3) feed it flag format Cyber...{***(MASKED)***}, output is probably the word they wantsahuang — Today at 11:20 PM
Some layers provided have array dimension much larger than 768, which is BERT dimension per layer, I guess that's something I should sort out?
Aymen — Today at 11:22 PM
Yes sahuang
Any possible hint on Misc/MLM on layer dimension? There are a lot of layers with dimensions much larger than 768 (though multiple), which cannot be added to default BERT model (but hint said use all default configs)
Plus, default BERT has 12 layers only.
Aymen
Default bert layer dimensions are well known, you can reconstruct these from the given arrays
Think about how can u do it!
sahuang
Do you mean some sort of average on pooling layers technique?
e.g. take avg of a 2x2 and consider it as a weight value
Aymen
No you won't need that
Can I have a look at you're code?
sahuang
I wrote some code to get all 202 arrays, each having a different size
23440896, 393216, 1536, 768, 768, 589824...
Then I loaded a BERT default model (following some online tutorial) and try to add layers to it, but only 768 can be added, which is why I had that question
Aymen
Would be nice if you check number of layers of bert model and dimensions of each layer, this would definitely help u! (edited)Guesslemonger — Today at 11:58 AM
hi, for mlm we have 202 layers whereas default BERT uses 12 layers, also dimension of many layers is over 768. Is the idea here to narrow down 202 layers to 12 layers first?
all with 768 dimension
Ouxs — Today at 12:15 PM
@Aymen
Guesslemonger — Today at 12:50 PM
author offline?
Guesslemonger — Today at 1:00 PM
ok so I researched a bit more, started with 0 knowledge of this. are these 202 pkl files for context? since default BERT won't know what to do with CyberErudites{[MASK]}
so we are trying to expand the vocabulary basically
Aymen — Today at 1:00 PM
Indeed BERT does have only 12 layers, but if you take a look at each layer you'll find that each one consists of query, key, value, dropout, ..
Guesslemonger — Today at 1:02 PM
is this the idea?
Aymen — Today at 1:03 PM
vocab has already been expended
Guesslemonger — Today at 1:04 PM
umm so these files are some components of the layer which we can change
so that model identifies flag format
Aymen — Today at 1:04 PM
you're not asked to change anything, only reconstruction
Guesslemonger — Today at 1:05 PM
so we have a default BERT model, we reconstruct it using these 202 files?
Aymen — Today at 1:05 PM
you're on the right track!>>> from transformers import pipeline
>>> model = pipeline('fill-mask', model='bert-base-uncased')
>>> pred = model("What is [MASK] name?")
>>> pred
[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}]from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters) (edited)from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters)
so basically every parameter of model is given as a separate pickle file
and need to merge them to create bert model
<bound method ModuleUtilsMixin.num_parameters of BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
first part of output is word_embeddings, 30522 * 768 = 23440896
and length of first pkl file is indeed thatfrom transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())
print(model.parameters) (edited)(word_embeddings): Embedding(30522, 768, padding_idx=0)
Do you know how to do the rest?from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters)
so basically every parameter of model is given as a separate pickle file
and need to merge them to create bert model
<bound method ModuleUtilsMixin.num_parameters of BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
first part of output is word_embeddings, 30522 * 768 = 23440896
and length of first pkl file is indeed that from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())
shapes = []
for j, param in enumerate(model.parameters()):
if j == 0:
print(param.data)
shapes.append(param.shape)
for j, param in enumerate(model.parameters()):
# update param to our weights
# if 2d, need to reshape pks[j]
if len(shapes[j]) == 2:
param.data = torch.from_numpy(pks[j]).view(shapes[j])
else:
param.data = torch.from_numpy(pks[j])
for j, param in enumerate(model.parameters()):
if j == 0:
print(param.data)
assert param.shape == shapes[j]
Use this code to get it (pks is the array of pkl data)>>> from transformers import pipeline
>>> model = pipeline('fill-mask', model='bert-base-uncased')
>>> pred = model("What is [MASK] name?")
>>> pred
[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}] from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True)
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
word = tokenizer.decode([token])
new_sentence = text.replace(tokenizer.mask_token, word)
print(new_sentence)
to use input
from https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True)
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
word = tokenizer.decode([token])
new_sentence = text.replace(tokenizer.mask_token, word)
print(new_sentence)
to use input
from https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209 from torch.nn import functional as F
from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch
import pickle, numpy as np
pks = [] # store all the weights
for i in range(1, 404, 2):
file = f"session_{i}.pkl"
t = np.load(open(file, "rb"), allow_pickle=True)
pks.append(t)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())
shapes = []
for j, param in enumerate(model.parameters()):
shapes.append(param.shape)
for j, param in enumerate(model.parameters()):
# update param to our weights
# if 2d, need to reshape pks[j]
if len(shapes[j]) == 2:
param.data = torch.from_numpy(pks[j]).view(shapes[j])
else:
param.data = torch.from_numpy(pks[j])
text = "CyberErudites{" + tokenizer.mask_token + "}"
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
word = tokenizer.decode([token])
new_sentence = text.replace(tokenizer.mask_token, word)
print(new_sentence)
Full code>>> from transformers import pipeline
>>> model = pipeline('fill-mask', model='bert-base-uncased')
>>> pred = model("What is [MASK] name?")
>>> pred
[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}] for j, param in enumerate(model.parameters()):
# update param to our weights
# if 2d, need to reshape pks[j]
if len(shapes[j]) == 2:
param.data = torch.from_numpy(pks[j]).view(shapes[j])
else:
param.data = torch.from_numpy(pks[j])
Everything other than comment is done by copilottorch.from_numpy(pks[j]).view(shapes[j]) from torch.nn import functional as F
from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch
import pickle, numpy as np
pks = [] # store all the weights
for i in range(1, 404, 2):
file = f"session_{i}.pkl"
t = np.load(open(file, "rb"), allow_pickle=True)
pks.append(t)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())
shapes = []
for j, param in enumerate(model.parameters()):
shapes.append(param.shape)
for j, param in enumerate(model.parameters()):
# update param to our weights
# if 2d, need to reshape pks[j]
if len(shapes[j]) == 2:
param.data = torch.from_numpy(pks[j]).view(shapes[j])
else:
param.data = torch.from_numpy(pks[j])
flag = ''
while not flag.endswith('}'):
text = flag + tokenizer.mask_token
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
word = tokenizer.decode([top_10[0]])
new_sentence = text.replace(tokenizer.mask_token, word)
flag = new_sentence.replace('##','')
print(flag) (edited)